4.2 Testing the Difference of Proportions

conventional methods such as difference of two proportions (here, the proportions are the estimated generalization accuracies from a test set)

「2つの割合(proportion)の差分をとるような従来の手法」（分類モデルを比較）

「ここでは割合とはテストセットから見積もった汎化accuracy」

z-score test for two population proportions

2つのモデルのaccuracyの95%信頼区間が重ならない場合、「2つの分類器の性能は等しい」という帰無仮説を5%の確信度で棄却する

（5%→20回に1回は間違えるかもしれないという話を思い出した）

偽陽性率（2つのモデルに差はないのに誤って差を検出）が高い傾向があるため、慣例的におすすめされない

手順（z-score以外の検定にも一般に該当）

1. 検定される仮説を定式化する

帰無仮説、代替仮説

2. 重要な閾値を決める

先の5%の確信度

3. 検定統計量を計算し、対応するp値（確率）を比較する

z-scoreなど

4. 所与の確信度にて帰無仮説を採択または却下

感想：テストセットは独立であればサイズが異なってもよい（n1, n2）

due to using the same test set (and violating the independence assumption) we have n1 = n2 = n,

同一のテストセットを使う（そして独立であるという仮定を破る）ため n1 = n2 = n

if |z| is higher than 1.96

このとき95%の確信度で帰無仮説（2つのモデルの汎化性能に差はない）を棄却

The problem with this test though is that we use the same test set to compute the accuracy of the two classifiers

「z-score検定の問題は、同一のテストセットを使って2つの分類器のaccuracyを計算すること」

検定の仮定を破っている

「paired testのほうが望ましい」

次の4.3でMcNemar test（「paired testのロバストな代替案」）